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1. INTRODUCTION 

Several public hospitals use an electronic medical record (EMR) system to manage patient data, such 
as data on patient care, diagnostic procedures, medical history and allergic conditions [1]—[3]. Owing to the 
increasing number of EMRs globally, medical information is too complex and voluminous to be processed and 
analysed using traditional methods. In recent years, the clinical information in EMRs is being fully used 
through data mining to identify patterns using a combination of machine learning and statistics [4]—[6]. 

Data mining has been conducted ever since its emergence in the late 1980s. It is a cross-disciplinary 
process that focuses on exploring very large datasets to uncover patterns [7]. Similarly, machine learning 
resides within artificial intelligence and is closely linked with data mining, its predecessor. Machine learning 
primarily uses learning algorithms to identify patterns in sample data and then uses these patterns to make 
decisions without being specifically programmed for this purpose [8]. 

Machine learning has been widely applied in medical research by mining EMRs using data mining 
methods to find patterns that can aid medical practitioners mainly in decision-making [9], [10]. Some machine 
learning techniques are regression, classification, clustering, dimensional reduction and reinforcement 
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learning. These techniques are commonly used as a tool for the solution of classification, forecasting and 
prediction problems [11]. 

Clustering, a common machine learning technique, involves grouping objects according to similarities 
in their data and classifying each into a specific group. The algorithm analyses the clustered data, identifies a 
group of data that share the same traits and develops a model from these data [12]. For example, clustering is 
used to group patients by gender, age and blood type based on their demographics. 

In this study, we investigate the literature on clustering algorithms to identify challenges associated 
with analysing medical data in recent research. Prior reviews on clustering techniques have paid less attention 
to medical research. Therefore, we present the outcomes of a systematic mapping study to categorise recent 
research on clustering algorithms used to analyse medical data. Through this study, we aim to address this 
research gap by seeking answers to three main research questions: (RQ1) What are the studies that have 
investigated clustering techniques for analysing EMRs? (RQ2) What are the studies that have discussed 
problems associated with clustering techniques for analysing EMRs? (RQ3) What are the studies that have 
used evaluation parameters on clustering techniques? 

Through this systematic mapping study, we provide researchers and practitioners with an overview of 
the existing clustering algorithms for analysing medical data. We also provide practical insights by exploring 
these research questions. Overall, this systematic mapping study contributes to the knowledge base on 
clustering algorithms for analysing medical data. 


2. RELATED WORK 

EMRs are the innovation that eliminate the need for traditional paper charts containing the patient’s 
medical history [1]. The use of EMRs enables researchers to gain new insight into the depth of large datasets 
[13]. It is another way to learn new medical knowledge, apart from through conducting clinical and biological 
experiments. An EMR can store an extensive range of patient information, such as the history of medication, 
vaccination, test results and insurance information. 

EMRs are used to store all past clinical information of each patient in detail during their treatment at 
medical institutions [14]. The incidence and clinical characteristics of diseases can be identified through mining 
all previous clinical information recorded in the EMR system [15]-[17]. Data mining has become an important 
technique to help researchers extract knowledge from large and complex data, such as medical records for 
patients [18]-[20]. 

Data mining is also termed ‘knowledge discovery and data mining’. The data mining process mainly 
aims to extract essential and useful information from a database [7]. It involves data pre-processing techniques, 
such as data cleaning, integration, transformation and reduction; pattern discovery and knowledge evaluation 
[21]. Various data mining techniques, such as association, classification and clustering, can be used to extract 
interesting patterns from huge amounts of data. 

The technique first used for data mining is association. It is a method used to detect interesting 
relationships between variables in large databases [22]. It can help to identify the rules relations in data 
collection between the medical authorities, such as medication, symptoms, disease and health condition 
[23], [24]. Next, classification is a method to extract a model with significant classes described. Some examples 
of classification techniques are the decision tree algorithm and the Naive Bayes algorithm. 

Clustering is another data mining technique. It involves grouping data with identical features into 
classes or clusters [22]. It can be applied in different fields, such as artificial intelligence, pattern recognition 
and neural networks [25]. In addition, clustering algorithms have been widely used for analysing medical data. 
There are several clustering techniques, such as the partitioning algorithm, the hierarchical algorithm and the 
density-based algorithm [26]. 

K-means is a known partitioning clustering algorithm commonly used to analyse EMRs. For example, 
it is used to cluster EMRs to identify the relevant features of patients with kidney failure and heart disease [27]. 
K-means is the most popular clustering algorithm because of its success in yielding efficient clusters, its ease 
of implementation, its improvement of prediction accuracy of and its flexibility [28]. 

The hierarchical algorithm is another clustering technique. It is quite popular owing to its visualisation 
capability [23]. It focuses on building a cluster hierarchy [29]. Hierarchical clustering groups data by type ina 
chain of importance, and it is used for probability analysis in healthcare [27]. This technique is commonly used 
for measuring physiological features across all patient medical records [30]. The last clustering algorithm 
technique is density-based clustering algorithm. Density-based spatial clustering of applications with noise 
(DBSCAN) is one of the most famous density-based clustering techniques. DBSCAN is more efficient in 
finding clusters, has the attribute of noise cancellation and is robust to outliers [31], [32]. The density-based 
algorithm is also broadly used on healthcare and medical datasets such as biomedical images. For example, 
DBSCAN is used on skin lesion images to detect homogeneous colour regions. In addition, DBSCAN is used 
to identify populations with dengue fever [29]. 
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3. METHOD 

A systematic mapping study is a type of well-defined methodology used to primarily identify and 
review all possible available research evidence that is relevant. Consequently, it is used to provide a clear 
explanation for specific research questions in a study. In addition, these questions should have been considered 
in the literature. In this mapping study, we adopted the guidelines suggested by researchers in software 
engineering [33], [34]. 

We categorised and structured empirical evidence based on studies that discussed the clustering 
algorithms used to analyse medical data. Thus, this study will help researchers and practitioners to gather, 
classify and aggregate results in related studies to identify research gaps and challenges for improvement in 
the body of knowledge. The mapping process consists of five activities: defining research questions, 
implementing the search strategy, selecting studies, keywording of abstracts and extracting data [35], [36]. 


3.1. Research questions 

Framing a clear, concise research question is an obligation for any systematic mapping study. Without 
a specified question, writing a well-written, well-focused review can be very onerous. For that reason, this 
study’s research questions are developed using the PICOC (Population, Intervention, Comparison, Outcome, 
Context) model of Kitchenham ef al. [34]. We constructed three research questions as follows: 

RQI: What are the studies that have investigated clustering techniques for analysing EMR? 

RQ2: What are the studies that have discussed problems associated with clustering techniques for 
analysing EMR? 

RQ3: What are the studies that have used evaluation parameters on clustering techniques? 


3.2. Search strategy 

A thorough search is required for the process of identifying relevant literature. For this reason, a search 
strategy is necessary, which involves identifying the right search string with the relevant terms and keywords. 
This approach will ensure wide coverage and increase the chance of identifying the right publications. 
Therefore, in this mapping study, we constructed the search string as follows: 

— Identify keywords in relevant papers. 

— Search synonyms for identified keywords. 

— Formulate search strings using Boolean OR to include alternative spellings and synonyms, and Boolean 
AND to combine major terms. 

The search string: (“clustering technique" OR " clustering algorithm" OR "machine learning" OR 
"data mining”) AND (“analyse”) AND (“disease”) AND (“electronic medical record" OR " EMR" OR 
"medical data”) AND (LIMIT-TO (SUBJAREA, "COMP")) AND (LIMIT-TO (LANGUAGE, "English")) 

In the search process, we included papers from conferences and journals on the research topics, such 
as the System & Software Journal and the Biomedical Informatics Journal. We limited our research to the 
computer science domain. We considered papers published in the five years from 2016 to 2021. We used the 
search string to conduct a primary search on SCOPUS, ScienceDirect, SpringerLink and ACM (Association 
for Computing Machinery) Digital. The search results are presented in Table 1. 


Table |. Primary search results 
Online database Search results Duplicate papers __ Relevant papers 


SCOPUS 92 25 
Science Direct 329 15 13 
SpringerLink 96 9 4 
ACM Digital 182 1 6 
Total 699 50 27 


3.3. Selection of papers 

The only inclusion criteria for the overall selection process were that the selected studies should be 
empirical studies related to clustering algorithms in the domain of medical data. Therefore, the literature search 
covered studies published within the 5-year period of 2016-2021. The specific inclusion criteria were as 
follows: (a) Peer-reviewed papers that provide evidence on analysing and evaluating EMR using clustering 
algorithms; (b) state the evaluation criteria for clustering techniques; and (c) use EMRs. The exclusion criteria 
were as follows: (a) Papers that provide evidence on the clustering techniques/algorithms but do not use EMRs 
or medical data and b) are not in English. 
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On performing the automatic search, we obtained a set of 699 papers. In the screening process, first, 
we screened the title and abstract of each paper in order to exclude unrelated and identical documents, which 
reduced the number of papers to 121. Next, we scanned the introduction and conclusion of each paper, which 
reduced the first set to only 27 papers. 


3.4. Classification scheme 

First, we observed the keywords and concepts in the abstract of each paper. Next, we combined the 
keywords and concepts to produce an outline of the issues and challenges of the background research. We then 
selected suitable keywords to structure the schemes or categories. Last, we determined three different 
categories, which are clustering techniques for analysing EMRs (RQ1), problems in clustering techniques 
(RQ2) and evaluation criteria used to evaluate clustering techniques (RQ3). 


3.5. Data extraction and mapping of studies 

In this process, we used Excel spreadsheet for data logging and output production. We mapped the 
identified studies into three categories and obtained relevant information from these to answer the research 
questions. We tabulated the results into tables and calculated the publication frequencies in each category. 


4. RESULTS 
4.1. Clustering techniques for analysing EMR (RQ1) 

Clustering algorithms have been widely used to analyse medical data [31], [37]-[40]. Three clustering 
techniques are commonly used for analysing EMRs: partitioning, hierarchical and density-based algorithms. 
K-means is a famous partitioning clustering algorithm that is commonly used for analysing EMRs [37]-[39], 
[41]. It is used to group a collection of patient records to identify the relevant features of patients with heart 
disease [42], [43]. In addition, some research efforts have been devoted to exploiting clustering techniques on 
data related to kidney failure [27]. Moreover, k-means clustering has also been used in a study to group patients 
with Parkinson’s disease [44]. 

Next, the hierarchical algorithm is another clustering method discussed in recent studies owing to its 
visualisation capability, which is requested by many physicians [26]. Hierarchical clustering aims to construct 
a cluster hierarchy and can be regarded as a partitional clustering sequence. Its groups data included in a type 
of chain of importance that are utilised for expectation in healthcare. Next, hierarchical clustering is commonly 
used for physiological measurements across all patients [30]. 

Density-based clustering algorithms are widely used to find clusters of nonlinear and arbitrary shapes 
based on the density of the connected points. DBSCAN is one of the most famous density-based clustering 
techniques [45]-[47]. The density-based algorithm is broadly used on healthcare and medical records such as 
biomedical images [48], [49]. DBSCAN is used to detect homogeneous colour regions in skin lesion images. 
It is also used to discover the hidden cluster and pattern for heart disease [50]. Further, it is used to determine 
the population with dengue fever [29]. 

We found 27 relevant studies, of which only 23 fulfilled the three inclusion criteria. We noticed that 
some had included more than one clustering technique. For example, k-means and DBSCAN were both used 
in one study [50]. Figure | presents the distribution of studies on clustering techniques for analysing EMR. 

As Figure 1 shows, the number of studies on clustering techniques for analysing EMRs increased from 
2016 to 2021. In 2019, the highest number of studies was published on the partitioning clustering technique, 
namely, 7 studies. Meanwhile, six studies were on hierarchical clustering and three on density-based clustering. 


4.2. Clustering technique problems for analysing EMR (RQ2) 

There are many problems related to the clustering technique. However, we only show the related and 
mentioned issues present in the papers without any clarification for categorization. In addition, papers that 
present more than one problem were counted in every category, and the total sums up to more than 27. Table 
2 shows the frequency of problems in the clustering techniques. 

As previously mentioned, we identified 27 studies that met the inclusion criteria. As illustrated in 
Table 2, the partitioning clustering algorithm is not recommended for datasets with noise and many outliers. 
This is because it is highly susceptible to noise and outliers with abnormal values [27]. The k-means 
partitioning clustering technique is biased to spherical clusters and is sensitive to the initial selection of 
centroids. Moreover, k-means is less efficient in partitioning the initial collection of data into subsets of 
different data distributions, including patients with different examination history [43]. 

Owing to space and time constraints, the hierarchical algorithm is sometimes not suitable for large 
data. If no data or limited data are available, hierarchical clustering may be a suitable option because the number 
of clusters is not predetermined. However, hierarchical clustering is expensive and not scalable. Therefore, it 
is not suitable for big data [29]. In addition, it is hard to detect the gaps between the objects. 
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Figure 1. Distribution of studies on clustering techniques for analysing EMR 


The process of density-based clustering is slow for large datasets. DBSCAN is able to find clusters 
with different sizes and forms, but it is weak in identifying clusters of variant density [42]. This limitation can 
be overcome by decomposing the clustering process in subsequent steps with the multiple-level DBSCAN 
algorithm. Another issue is that the DBSCAN algorithm fails to track cluster centres and cannot be simply 
trained and utilised with new data [32]. 


Table 2. Distribution of papers on clustering technique problems 


Clustering techniques Problems No. of papers 
Partitioning clustering —  P1: Highly susceptible to noise and abnormal values outliers. 10 
—  P2: Biased to spherical clusters. 4 
—  P3: Sensitive to the initial selection of centroids. 2 
—  P4: Less efficient in partitioning initial collection of data into subsets. 2 
Hierarchical clustering — HI: Not suitable for big data owing to space and time limitations. 3 
—  H2: Expensive and not scalable. 3 
—  H3: Hard to detect the gaps between the objects. 1 
Density-based clustering -— D1: Slow for large datasets. 3 
— D2: Weak in identifying clusters of variant density. 1 
— D3: Unable to track cluster centres. 2 
— D4: Cannot be simply trained and utilised with new data. 1 
Total 31 


4.3. Evaluation parameters used in clustering techniques (RQ3) 

All these algorithms are explained and analysed based on certain evaluation parameters, such as the 
number of clusters, cluster distance, epsilon value and threshold value. Table 3 illustrates the distribution of 
relevant papers for the common evaluation parameters used in 21 studies. 

As Table 3 shows, several evaluation parameters are used in clustering techniques. The most used 
evaluation parameter is the number of clusters-9 papers applied this parameter to evaluate clustering 
performance. We found that two papers each used the sum of squared error, silhouette score, overall similarity, 
cluster density and threshold value as an evaluation parameter of the clustering technique. In addition, some 
other parameters can be used to evaluate the clustering technique, such as cluster distance, the Davies—Bouldin 
Index, epsilon value and neighbourhood size. 


4.5. Mapping 


To provide an overview of clustering algorithms research analysing medical data in the literature, we 
present a map in Figure 2 that depicts the distribution of papers on clustering algorithms analysing medical 
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data according to their techniques, problems and common evaluation parameters used. This mapping study 
presents each paper in a bubble form, where the size and number of each bubble represent the frequency of 
papers classified for that category. Because a paper may contribute more than one technique, problem or 
evaluation parameters, each of these related papers is divided into more than one factor in each category. For 
that reason, the total number of paper counts in the bubble plot is not equal to the total of 27 relevant papers. 


Table 3 Distribution of papers on evaluation parameters 


No. _ Evaluation parameter___No of papers 
1. Sum of Squared Error 2 
2. Silhouette Score 2 
3. Overall Similarity 2 
4. — Cluster Density 2 
5. Cluster Distance 1 
6.  Davies—Bouldin Index 1 
7. Epsilon Value 1 
8. Neighbourhood Size 1 
9. | Number of Clusters 9 
10. _ Threshold Value 2 

Total 21 
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Figure 2. Distribution of clustering algorithms research by techniques, problems and evaluation parameters 


5. DISCUSSIONS 

In this study, we aimed to determine the status of recent research on clustering algorithms for 
analysing medical data. We performed a mapping study to identify the clustering techniques used, associated 
problems and the evaluation parameters used in these techniques and found 27 such studies. Almost all the 
reviewed papers were published between 2016 and 2021. 

Thus, this study highlights the clustering techniques used to analyse EMRs. We found that 23 studies 
have applied clustering techniques for analysing EMRs, namely partitioning, hierarchical and density-based 
algorithms. Among all the techniques, the partitioning technique is the most popular. Some studies applied 
more than one clustering technique for data analysis. Further, 18 studies used partitioning clustering to cluster 
and analyse medical data, followed by five studies that used hierarchical clustering and three that used density- 
based clustering. 

Next, this study emphasises the problems involved in these clustering techniques. Partitioning 
clustering is not recommended for noisy datasets because it is highly susceptible to noise. Next, hierarchical 
clustering is sometimes not suitable for big data because it is expensive and not scalable. Meanwhile, density- 
based clustering is weak in identifying clusters of variant density. Another issue is the failure to track cluster 
centres; further, it cannot be simply trained and utilised with new data. 

Last, this study also highlights the evaluation parameters used in clustering techniques. Clustering 
techniques are analysed based on certain evaluation parameters. We identified 10 evaluation parameters: sum 
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of squared error, silhouette score, overall similarity, cluster density, cluster distance, the Davies—Bouldin Index, 
epsilon value, neighbourhood size, number of clusters and threshold value. We found that the number of 
clusters is the most popular evaluation parameter, with nine studies using it. 


6. CONCLUSION 

This systematic mapping study has provided an in-depth review of clustering algorithms used for 
analysing medical data. Concerning the limitations of the review, as we stated, according to the selection 
criteria, we included only studies that clearly examined clustering algorithms for analysing medical data. We 
excluded studies that described clustering algorithms but did not use EMRs or medical data. The results show 
that three main types of clustering techniques for analysing EMR which are partitioning, hierarchical and 
density-based clustering. Meanwhile, based on the mapping results, the most addressed problems of clustering 
techniques in the literature are issues that related with highly susceptible to noise and outliers with abnormal 
values. However, the results also reveal a lack of significant research in addressing evaluation parameterr 
related with cluster distance, the Davies—Bouldin Index, epsilon value and neighbourhood size. We encourage 
researchers to consider clustering algorithms used to analyse medical data for improving the management of 
medical data. 
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